Search CORE

404 research outputs found

External Lexical Information for Multilingual Part-of-Speech Tagging

Author: Sagot Benoît
Publication venue
Publication date: 01/06/2016
Field of study

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

INRIA a CCSD electronic archive server

Hal-Diderot

Comparing Complexity Measures

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 22/02/2013
Field of study

International audienc

INRIA a CCSD electronic archive server

Hal-Diderot

Étiquetage multilingue en parties du discours avec MElt

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 04/07/2016
Field of study

International audienceWe describe recent evolutions of MElt, a discriminative part-of-speech tagging system. MElt is targeted at the optimal exploitation of information provided by external lexicons for improving its performance over models trained solely on annotated corpora. We have trained MElt on more than 40 datasets covering over 30 languages. Compared with the state-of-the-art system MarMoT, MElt's results are slightly worse on average when no external lexicon is used, but slightly better when such resources are available, resulting in state-of-the-art taggers for a number of languages.Nous présentons des travaux récents réalisés autour de MElt, système discriminant d'étiquetage en parties du discours. MElt met l'accent sur l'exploitation optimale d'informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d'une quarantaine de jeux de données couvrant plus d'une trentaine de langues. Comparé au système état-de-l'art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l'absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l'art pour plusieurs langues

INRIA a CCSD electronic archive server

Hal-Diderot

Building a free French wordnet from multilingual resources

Author: Fišer Darja
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 31/05/2008
Field of study

International audienceThis paper describes automatic construction a freely-available wordnet for French (WOLF) based on Princeton WordNet (PWN) by using various multilingual resources. Polysemous words were dealt with an approach in which a parallel corpus for five languages was word-aligned and the extracted multilingual lexicon was disambiguated with the existing wordnets for these languages. On the other hand, a bilingual approach sufficed to acquire equivalents for monosemous words. Bilingual lexicons were extracted from Wikipedia and thesauri. The results obtained from each resource were merged and ranked according to the number of resources yielding the same literal. Automatic evaluation of the merged wordnet was performed with the French WordNet (FREWN). Manual evaluation was also carried out on a sample of the generated synsets. Precision shows that the presented approach has proved to be very promising and applications to use the created wordnet are already intended

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Verbes de citation et Tables du Lexique-Grammaire

Author: Danlos Laurence
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/09/2010
Field of study

International audienceCet article se propose d'étudier systématiquement comment et où se répartissent les verbes qui peuvent être la tête d'une incise de citation dans les tables de verbes simples du lexique-grammaire (LG). Dans l'état actuel, seule la Table 9 code cette propriété (colonne 'P', V N0 à N2)

INRIA a CCSD electronic archive server

Hal-Diderot

Could Greek and Italic share a same Indo-European substratum?

Author: Garnier Romain
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 27/07/2015
Field of study

International audienceGreek and Latin have developed from their common Proto-Indo-European (PIE) ancestor in distinct ways, resulting in two languages that exhibit very different features, in particular regarding phonology and Wortbildung. Moreover, the Greek lexicon has long been recognised for its huge proportion of non-inherited words, among which it is difficult to draw a clear distinction between substrata and loan words. Several of the languages that contributed to shaping the Greek lexicon are Indo-European. Among the Indo-European contributors to the non-inherited Greek lexicon, we tentatively identify a language that shares phonetic and morphological features with substratic elements attested in Italic, and possibly articulatory properties of Latin itself. We shall review five phonetic features of this language: (i) voiceless reflexes of PIE voiced aspirated stops; (ii) the anticipation of nasals resembling lex-unda in Latin but generalised to labial stops, such that VCnV > VnGV with lenition of the consonant; (iii) a velarised /ł/ (viz. l pinguis) which can trigger an anaptyctic -ŏ- or -ŭ-; (iv) apparent voice alternations that follow similar patterns to the Verner law in Germanic; (v) the metathesis of -r-, such that CVrC > CrVC. Our study also unveils morphological peculiarities of this language: (a) the frequent use of elsewhere poorly attested labial morphs, leading to nouns of the form *CóC-Po- and adjectives of the form *CoC-Pó-; (b) the frequent use of a prefix *eǵhs- (cf. Lat. ex-, Gr. ἐξ-) reflected as a simple *s-; (c) the frequent occurrence of action nouns built with the well-known *CóC-no- pattern

HAL-UNILIM

HAL Clermont Université

INRIA a CCSD electronic archive server

Hal-Diderot

Data-driven Synset Induction and Disambiguation for Wordnet Development

Author: Apidianaki Marianna
Sagot Benoît
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2014
Field of study

International audienceAutomatic methods for wordnet development in languages other than English generally exploit information found in Princeton WordNet (PWN) and translations extracted from parallel corpora. A common approach consists in preserving the structure of PWN and transferring its content in new languages using alignments, possibly combined with information extracted from multilingual semantic resources. Even if the role of PWN remains central in this process, these automatic methods offer an alternative to the manual elaboration of new wordnets. However, their limited coverage has a strong impact on that of the resulting resources. Following this line of research, we apply a cross-lingual word sense disambiguation method to wordnet development. Our approach exploits the output of a data-driven sense induction method that generates sense clusters in new languages, similar to wordnet synsets, by identifying word senses and relations in parallel corpora. We apply our cross-lingual word sense disambiguation method to the task of enriching a French wordnet resource, the WOLF, and show how it can be efficiently used for increasing its coverage. Although our experiments involve the English-French language pair, the proposed methodology is general enough to be applied to the development of wordnet resources in other languages for which parallel corpora are available. Finally, we show how the disambiguation output can serve to reduce the granularity of new wordnets and the degree of polysemy present in PWN

INRIA a CCSD electronic archive server

Hal-Diderot

Normalisation de textes par analogie: le cas des mots inconnus

Author: Baranes Marion
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

International audienceAnalogy-based Text Normalization : the case of unknowns words. In this paper, we describe and evaluate a system for improving the quality of noisy texts containing non-word errors. It is meant to be integrated into a full information extraction architecture, and aims at improving its results. For each word unknown to a reference lexicon which is neither a named entity nor a neologism, our system suggests one or several normalization candidates (any known word which has the same lemma as the spell-corrected form is a valid candidate). For this purpose, we use an analogy-based approach for acquiring normalisation rules and use them in the same way as lexical spelling correction rules.Dans cet article, nous proposons et évaluons un système permettant d'améliorer la qualité d'un texte bruité notamment par des erreurs orthographiques. Ce système a vocation à être intégré à une architecture complète d'extraction d'information, et a pour objectif d'améliorer les résultats d'une telle tâche. Pour chaque mot qui est inconnu d'un lexique de référence et qui n'est ni une entité nommée ni une création lexicale, notre système cherche à proposer une ou plusieurs normalisations possibles (une normalisation valide étant un mot connu dont le lemme est le même que celui de la forme orthographiquement correcte). Pour ce faire, ce système utilise des techniques de correction automatique lexicale par règle qui reposent sur un système d'induction de règles par analogie

INRIA a CCSD electronic archive server

Hal-Diderot

Automated Error Detection in Digitized Cultural Heritage Documents

Author: Gábor Kata
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 26/04/2014
Field of study

International audienceThe work reported in this paper aims at performance optimization in the digitization of documents pertaining to the cultural heritage domain. A hybrid method is roposed, combining statistical classification algorithms and linguistic knowledge to automatize post-OCR error detection and correction. The current paper deals with the integration of linguistic modules and their impact on error detection

INRIA a CCSD electronic archive server

Hal-Diderot